Three learning stages and accuracy–efficiency tradeoff of restricted Boltzmann machines

Restricted Boltzmann Machines (RBMs) offer a versatile architecture for unsupervised machine learning that can in principle approximate any target probability distribution with arbitrary accuracy. However, the RBM model is usually not directly accessible due to its computational complexity, and Markov-chain sampling is invoked to analyze the learned probability distribution. For training and eventual applications, it is thus desirable to have a sampler that is both accurate and efficient. We highlight that these two goals generally compete with each other and cannot be achieved simultaneously. More specifically, we identify and quantitatively characterize three regimes of RBM learning: independent learning, where the accuracy improves without losing efficiency; correlation learning, where higher accuracy entails lower efficiency; and degradation, where both accuracy and efficiency no longer improve or even deteriorate. These findings are based on numerical experiments and heuristic arguments.

Labels of equations, figures, tables, and references in these Supplementary Notes are prefixed by a capital letter "S" (e.g., Fig. S1, Eq. (S3)). Any plain labels (e.g., Fig. 1, Eq. (3)) refer to the corresponding items in the main text. S1. TRAINING DETAILS With the exception of Fig. 3e, the data presented in the main text were obtained from RBMs trained with a stochastic gradient descent scheme based on the ideal gradient descent updates from Eq. (4) of the main text and utilizing contrastive divergence (CD, see Refs. [S1, S2]) or persistent contrastive divergence (PCD, see Ref. [S3]) to approximate the model averages. Concretely, the training dataset S = {x (1) , . . . ,x (|S|) } was partitioned randomly into s := |S| B minibatches S 1 , . . . , S s of size B at the beginning of each epoch t.
We recall that Markov chains of the form S1) are employed to assess the model distribution approximately (cf. Eq. (5) of the main text). We denote a particular (random) realization of x (n) and h (n) for a chain initiated at a (fixed) x (0) =x byx (n) (x) andĥ (n) (x), respectively. In CD, the updates (4) are then approximated as for r = 1, . . . , s. In essence, the model averages in (4) are thus approximated by empirical averages over samples from Markov chains (S1) that are initialized with a training sample x (0) =x ∈ S. If ∆ θ is sufficiently small, such x (0) from S may already be a reasonable approximation for a sample fromp θ (x), and the chain generates a new (but correlated) sample. In the beginning of training, whenp θ (x) is still far from p(x), such an initialization of the chains is less justified, but it is found to work in practice [S2, S4, S5], not least because the mixing and autocorrelation times of the chains are typically small as well in this case. In PCD, the updates (4) are approximated as for r = 1, . . . , s, where Q(0) is a set of random, independent configurations of the visible units and L := |Q(0)|. We always use L = B. Hence the model averages are approximated by empirical averages over samples from Markov chains (S1) that are initialized with (or, in other words, continued from) samples of the previous update step, i.e., the chains are persistent. Note, however, that the model distribution used to generate (or advance) the chains changes from step to step as the model parameters θ change. If these parameter updates are sufficiently small and the initial configurations x ′ ∈ Q(0) emulate samples from the initial modelp θ(0) (x), the chains approximately reflect the gradually evolving stationary model distribution throughout the training process. However, subsequent samples are generally not independent, especially if n CD ≲ τ θ . The total number of training epochs in Figs. 2-4 of the main paper as well as in this Supplementary Information varies between 2 × 10 4 and 3 × 10 6 , depending on the time needed until improvement of the accuracy could no longer be observed and extending reasonably far beyond it to capture the degradation regime. As mentioned in the main text, the machines were initialized by drawing the parameters θ k (0) independently from a normal distribution N (µ, σ) of mean µ and standard deviation σ, namely w ij (0) ∼ N (0, 10 −2 ) and a i (0), b j (0) ∼ N (0, 10 −1 ) (Figs. 2,3,and 4b,c) or a i = b j = 0 (Fig. 4e). Figs. 2 and 3 show averages over 5 independent repetitions of the experiment for each hyperparameter configuration. Fig. 4 shows results for single machines.
For the data in Fig. 3e, we utilized-as described in the main text-the full target distribution p(x) and the exact n-step CD model distributionp (n) θ or the full model distributionp θ (x) for the expectation values. Moreover, we employed the continuous-time limit of the update equations (4). Hence the evolution of the weights is governed by the differential equationṡ where the dots indicate derivatives with respect to t and . We then integrated Eqs. (S4) numerically (starting from random initial conditions as before) using Mathematica's NDSolve routine.

A. Transverse-field Ising chain
The transverse-field Ising chain (TFIC) with M sites is defined by the Hamiltonian with periodic boundary conditions, σ γ i+M = σ γ i , where σ γ i (γ = x, y, z) are the Pauli matrices acting on site i. The corresponding spin raising and lowering operators are σ ± i := 1 2 (σ x i ± iσ y i ). The Hamiltonian can be diagonalized by the following sequence of transformations: the Jordan-Wigner transformation with fermionic (bosonic) Matsubara frequencies k if the total particle number i c † i c i is even (odd), and the Bogoliubov transformation with u k := (ε k + α k )/ω k , v k := iβ k /ω k , α k := −2J(g + cos k), β k := 2J sin k, ε 2 k := α 2 k + β 2 k , ω 2 k := 2ε k (ε k + α k ); see, for example, Ref. [S6]. The resulting Hamiltonian is in the sector with an even number of particles, to which the ground state belongs. This ground state can be constructed as from the state |↓ · · · ↓⟩ with all spins down in the σ z basis [S6]. Moreover, by adjusting the global phase, it can be written such that ψ(x) := ⟨x 0 · · · x M −1 |ψ⟩ is real-valued and nonnegative when |x 0 · · · x M −1 ⟩ is a basis state in the σ z or σ x bases. For our quantum-state tomography task of the ground-state wave function from measurements in either of the two bases, we can therefore take the target distribution as p(x) := ψ(x) 2 and do not need additional modifications of the RBM model to facilitate the phase reconstruction [S7].
Tab. S1 lists the entropy S(p) := − x p(x) ln p(x) and total correlation C tot (p) (cf. Eq. (8) of the main text) of the target distribution obtained for various values of g.

B. Hook-pattern images
We consider L × L images x = (x i1,i2 ) L−1 i1,i2=0 whose pixels x i = x i1,i2 are either black (x i = 0) or white (x i = 1). We denote the set of all 2 L×L of these images by K.
The characteristic feature of the target distribution p(x) is a "hook" pattern comprised of a total of 15 pixels, cf. Fig. 3a. We assume periodic boundary conditions (x i,j = x i+L,j = x i,j+L ) and denote the max-distance on the image grid by Similarly, for a set of pixel positions (sites) I and a single site j, define d ∞ (I, j) := min{d ∞ (i, j) : i ∈ I}. An image x then shows the "hook" pattern at site i if the pixels at sites I 0 (i) := {(i 1 , i 2 ), (i 1 − 1, i 2 ), (i 1 , i 2 + 1)} are white and the pixels at sites I 1 (i) := {j : d ∞ (I 0 (i), j) = 1} are black. In other words, x j = 1 for all j ∈ I 0 (i) and x j = 0 for all j ∈ I 1 (i). The remaining L 2 − 15 pixels at sites I ≥2 (i) := {j : d ∞ (I 0 (i), j) ≥ 2} can take arbitrary values. We denote the set of all images with a "hook" pattern at an arbitrary site by K H . In the main text, we choose L = 5; examples are shown in Fig. 3a. Note that there are a total of |K H | = L 2 × 2 L 2 −15 = 25 600 images exhibiting the pattern, which is a fraction of |K H |/|K| = 25 × 2 −15 ≈ 0.08 % of all 5 × 5 images.
The target distribution picks a random location i = (i 1 , i 2 ) for the hook pattern and activates all remaining pixels in I ≥2 (i) with probability q = 1 10 . Hence The entropy is S(p) ≈ 6.47 and the total correlation is C tot (p) ≈ 4.53.
The one-dimensional, smaller distributions from Fig. 3d are structurally similar, but involve a core pattern of white pixels at sites I 0 (i) of only one (M = 4) or two (M = 5) pixels, surrounded by one-site boundaries I 1 (i) of black pixels in either direction, with the remaining pixel in I ≥2 (i) being arbitrary again.

C. Digit images
and denote the set of all such images by K.
The target distribution involves images showing one of ten patterns representing the digits 0 through 9 (cf. Fig. 4a of the main text). The digit patterns consist of a core block of fixed black or white pixels (black or white in Fig. 4a) as well as a boundary (gray-shaded in Fig. 4a) which may either be represented by black pixels or by the image frame (i.e., the gray pixels may lie "out of bounds"). No periodic boundary conditions are assumed. The remaining pixels outside of the respective pattern may take arbitrary values. Due to the different sizes of the digit patterns, the number of possible images for each pattern is different in general. We denote the set of all images with a "k" pattern by K k . For H = 7 and W = 5, which is our choice for the data shown in Fig. 4a-c of the main text, the resulting cardinalities of the sets K k are summarized in Tab. S2.
The target distribution selects a pattern k uniformly at random, p(x ∈ K k ) = 1 10 1 l K l (x). The selected pattern is then placed at a random position i = (i 1 , i 2 ) with uniform probability p(x ∈ K k,i | x ∈ K k ), where K k,i is the set of all images showing pattern k at position i. Due to the different pattern shapes, again, the admissible positions are generally different for different patterns. Denoting the set of all pixel sites that are not part of the pattern k at position i byĪ(k, i), the remaining pixels represent noise and are white independently with probability q = 1 10 , i.e., p(x j = 1|j ∈Ī(i, k)) = q. The total probability of a given image x is thus (S15) Note that an image can "accidentally" have patterns at multiple positions; for the image sizes we adopted, in particular, this can happen if one of them is the "1" pattern.
The entropy of this distribution is S(p) ≈ 8.35 and the total correlation is C tot (p) ≈ 11.67.
For the empirical loss measure∆

D. MNIST
We use the MNIST dataset of 28 × 28-pixel images of handwritten digits with its standard splitting into |S| = 60 000 training and |T | = 10 000 test images [S8]. Each subset contains equal fractions of representations for each digit. The grayscale pixel values z i ∈ {0, 1, . . . , 255} are preprocessed as x i = ⌊ zi 128 ⌋ to obtain a binary dataset. The entropies of the thus-obtained training and test datasets are S(p( · ; S)) ≈ 11.00 and S(p( · ; T )) ≈ 9.21, and their total correlations are C tot (p( · ; S)) ≈ 195.0 and C tot (p( · ; T )) ≈ 196.5, respectively. Kullback-Leibler divergence∆ (T,S) σ := DKL(p( · ; T )||pσ( · ; S)) between the empirical distribution of the test datap(x; T ) and the Gaussian-smoothened empirical distributionpσ(x; S) of the training data as a function of the smoothening width parameter σ. The value of σ for which∆ (T,S) σ becomes minimal is adopted in Fig. 4 to calculate the empirical loss measure∆ Similarly as before, the parameter σ for∆

S3. INITIALIZATION SCHEMES
As argued in the main text, the most natural way to initialize the RBM parameters (θ k ) = (w ij , a i , b j ) is to assign small random values to them if no information about the target distribution p(x) is available. In the main text, we thus sample the initial θ k from independent normal distributions N (µ, σ) with vanishing mean µ = 0 and small (or vanishing) standard deviation σ. In Figs. S2 and S3, we compare the resulting relationship between ∆ θ and τ θ for other initialization schemes.
One piece of information about the target distribution that is easily accessible is the approximate value of the marginal probabilities p i (x i ) of the visible units. Hence Hinton [S2] suggests to initialize the visible-unit biases as In the numerical examples from the second columns of Figs. S2 and S3, we cap the so-obtained a i at 2 in absolute value. For the weights and hiddenunit biases, Ref. [S2] suggests using w ij ∼ N (0, 0.01) and b j = 0. Following this scheme, one can shorten the independent-learning period as can be seen in the second columns of Figs. S2 and S3. Nevertheless, we observe the same learning characteristics as for the fully random initialization in the correlation-learning and degradation regimes.
By accident or deliberation, the initial model distribution may already exhibit noticeable but spurious correlations as well. This is examplified in the last three columns of Figs. S2 and S3. Such initial correlations may arise, for instance, if the weights are chosen too large (third columns) or if parameters from a pre-trained machine using a different target distribution are adopted (fourth and fifth columns). In the case of such spurious initial correlations, the machine typically starts further away from the lower bound (7). As training progresses, however, the bound is approached by decreasing both ∆ θ and τ θ first and eventually showing similar tradeoff characteristics as in the "independent" initialization schemes (first two columns).

S4. AUTOCORRELATION TIMES OF DIFFERENT OBSERVABLES
As explained in the main text (see Methods), different observables f (x) generally exhibit different integrated autocorrelation times τ (f ) θ . We recall that the latter are defined as where is the correlation function of f (x) for the Markov chain (S1) initialized in the stationary state x (0) ∼p θ (x) (cf. Eqs. (15) and (16) in the main text). The quantity τ θ from Eq. (6), which is our principal measure of sampling efficiency in the main text, is a weighted average of the autocorrelation times τ  (18)). In the following, we verify that the autocorrelation times τ (f ) θ typically scale similarly to τ θ . More precisely, they are found to be largely proportional to each other. For the power-law bound (7) which quantifies the accuracyefficiency tradeoff, this changes the constant c on the right-hand side to an observable-dependent c (f ) . Those constants can no longer be identified with the total correlation C tot (p) of the full target distribution p(x). Instead, the pertinent reference should be a characteristic of the distribution p(f (x)) of the transformed variables, for which, however, it may not always be possible or reasonable to define a "total correlation." In Fig. S4, we adopt the same setup and RBMs as in the third column of Fig. 2b of the main text (TFIC, σ z basis, n CD = 1, |S| = 25 000). We plot the integrated autocorrelation times τ (S18a) • the nearest-neighbor correlation function, • the next-nearest neighbor correlation function, (d,i) snapshot after t = 5000 training epochs of a machine initialized like in (a,f) and subsequently trained to learn the ground state for g = 1 and otherwise identical hyperparameters; (e,j) similar to (d,i), but using snapshots after t = 5000 training epochs when learning the g = 2 ground state. Data points show averages over 5 independent runs for each scheme. Fill colors indicate the total correlation Ctot(p θ ) of the model distribution (see colorbars), border colors and marker types indicate the number of hidden units N (see legend in (a)). Further hyperparameters: CD training with nCD = 1, η = 0.001, B = 100, |S| = 25 000.
• the 3-point nearest-neighbor correlation function, Note that, due to translational invariance of the target distribution p(x), all those observables should be independent of the reference index i ∈ {1, . . . , M }. However, the learned model distributionp θ (x) may not fully reflect this symmetry. The autocorrelation time τ (f ) θ shown in Fig. S4 is therefore averaged over all indices i. We also show the autocorrelation time for the general 2-point correlation function again averaged over all pairs (i 1 , i 2 ). (Note that this observable still depends on the difference i 1 −i 2 though.) Finally, we also include autocorrelation times for the mean over all visible units, We point out that there are RBM configurations for which we did not obtain a reliable estimate of τ θ or τ (f ) θ within the maximally admitted number of sampling steps (see also Methods in the main text). Therefore, the data points are sparser for N = 64, in particular. As mentioned above, the results indicate that τ (f ) θ is usually proportional to f (x). With regard to the seemingly largest (relative) deviations in the top-left panel, we observe that the autocorrelation times are generally very small in this case. We also remark that, due to translation invariance, the trained models should satisfy τ   tions from this ideal behavior can hint at overfitting and insufficient expressivity.
The same conclusions can be drawn from Fig. S5, which shows autocorrelation times for different observables in the digit-pattern example from Fig. 4a-c and Sec. S2 C. We remark that, due to the distinct geometry, some of the observables do not have the same physical meaning as in the TFIC example; for instance, x i x i+1 is a correlation function between nearest neighbors along the columns only.
Finally, for a more direct visualization of correlations between samples of Markov chains like (S1), we show snapshots from such chains for RBMs trained on the MNIST dataset in Fig. S6. The autocorrelation-time estimate τ θ conforms nicely with the number of steps needed to reach a sample that looks "new" or "uncorrelated" to the naked eye. The adequacy of τ θ to quantify correlations between Markov-chain samples and to estimate the additional steps required to obtain an independent sample is thus reinforced. 2 ] and subsequently thermalized for 2 × 10 6 steps before recording starts.

S5. EXTENDED MECHANISM
We expand on aspects of the discussion in the section "Mechanism" from the main text.

A. Minimal loss for independent units
As stated around Eq. (8) of the main text, the exact loss ∆ θ = D KL (p||p θ ) is bounded from below by the total correlation C tot (p) ifp θ consists of independent units. Indeed, ifp(x) := ip i (x i ), we can make the following decomposition: Recalling the definition (8) and the fact that D KL (p||q) ≥ 0 for arbitrary distributions p and q, we can conclude that D KL (p||p) ≥ C tot (p).

B. Hyperparameter independence of the tradeoff relation
In the main text, we argued that the relationship between ∆ θ and τ θ in the independent-and correlationlearning regimes is essentially independent of the basic RBM hyperparameters, including the learning rate η, the batch size B, the number of training samples |S|, the number of hidden units N , and the approximation scheme for model averages during training (CD vs. PCD and their order n CD ). Further evidence for this insensitivity is provided in Fig. S7 for the digit-generator example from Fig. 4a-c and Sec. S2 C.
We emphasize that the choice of appropriate hyperparameters is still important, because it affects the stability of training and the onset of the degradation regime, meaning that poor choices can lead to early deterioration of the RBMs.

C. Scaling of weights and correlations
The magnitude |θ k | of the RBM parameters typically grows during training. A distinctive property of many "real-world" machine-learning problems is that the target distribution is sparse, meaning that most states x have vanishing or at least very small probability p(x). For example, the overwhelming majority of all possible images with a given number of pixels will not display "realistic" motifs (e.g., digits, letters, animals, clothes, buildings, ...). This characteristic is at odds with the RBM model family, which assigns a finite probabilityp θ (x) > 0 to all states x. To suppress the unlikely states, many of the parameters θ k have to take large absolute values.
As argued around Eq. (11) and illustrated in Fig. S8, this usually increases the correlations between subsequent Markov-chain samples. Notably, correlations between visible and hidden units, and thus between two (or more) visible units, arise only if |w ij | > 0. Hence larger |θ k | and larger |w ij | in particular hint at larger autocorrelation times τ θ . Note, however, that |w ij | > 0 for some i, j is not sufficient to obtain correlations between different visible units; to this end, two visible units x i1 and x i2 must be coupled to the same hidden unit h j , i.e., both |w i1j | > 0 and |w i2j | > 0 must hold.
Since it is computationally demanding to estimate τ θ reliably, the standard deviation σ w := ( 1 M N −1 i,j w 2 ij ) 1/2 can be considered as a more accessible indicator of growing correlations in practice. In Fig. S8, we show the mutual dependence of τ θ and σ w for the examples from Sec. S2 A-S2 C and various numbers of hidden units N . We observe that the two quantities are indeed positively correlated, which confirms, in particular, that τ θ typically grows with the magnitude of the weights.
The necessity to increase the RBM parameters θ k in magnitude in order to represent sparse distributions with many "inactive" states x such that p(x) = 0 is also b "hook" pattern recognition c digit generator a TFIC ground-state tomography one possible hint at the source of the different α values observed in Fig. 2d compared with Fig. 2b ( and also Figs. 3b and 4b,c). The image distributions from Figs. 3a-c and 4b,c all have a large fraction of such "inactive" states (99.92 % and 99.88 %, respectively, see Secs. S2 B and S2 C). Likewise, the TFIC ground state in the σ z basis (when α = 1 2 ) has p(x) = 0 for half of the states by symmetry (see Sec. S2 A and S5 D), whereas no such inactive states exist in the σ x representation (when α = 6 . . . 8).
Within the examples from Fig. 2d, the case with g = 1 stands out because it also exhibits a pronounced intermediate stage in the correlation-learning regime where the relationship between ∆ θ and τ θ does not follow the power-law tradeoff of the global bound with α ≃ 6, but instead shows a stronger tradeoff with α ≃ 1 4 . A more detailed analysis (see Sec. S5 D) suggests that this is caused by the emerging strong bimodal structure of the target distribution as g becomes smaller, which impedes and eventually prevents efficient learning.

D. Basis-dependent learning characteristics in the transverse-field Ising chain
We investigate the emergence of an intermediate stage in the correlation-learning regime, where learning is less efficient than admitted by the global power-law tradeoff, in the example of ground-state tomography for the transverse-field Ising chain in the σ x basis as g becomes smaller (see, in particular, Fig. 2d of the main text).
Since there are some important differences in the learning characteristics of the TFIC ground-state distribution in the σ z and σ x bases (cf. Figs. 2b and d), we first examine the underlying target distribution p(x) in more detail. To avoid confusion in the following, we denote the target distribution as p z (x) when referring to the σ z basis representation and as p x (x) when referring to the σ x basis representation. The Hamiltonian (S7) has a Z 2 symmetry of the form F := i σ z i , i.e., [H, F ] = 0. In the σ z basis, F measures the parity of the number of down spins, whereas it describes a spin-flip symmetry in the σ x basis.
For even M , to which we restrict our discussion exclusively in this work, the ground state (S12) of the Hamiltonian (S7) lies in the F = +1 sector. In the σ z basis, this means that p z (x) = 0 for half of the basis states, namely all configurations x = (x 1 , . . . , x M ) for which i x i is odd. In the σ x basis, by contrast, it implies that This distinct manifestation of the symmetry F is the first important distinction between the σ z -and σ x -basis representations. Learning the σ x representation is thus aided by the fact that the RBM model can encode relative spin orientations (alignment or anti-alignment) particularly efficiently, requiring only a single hidden unit for an arbitrary combination of visible ones, although the effectively needed number of weights appears to depend also on the proximity to the critical point [S9]. By contrast, due to the exponential form of the model distribution, it is rather inefficient at enforcingp θ (x) = 0 for individual states x as required by the σ z representation p z (x). These observations hint at possible origins of the different exponents α in the tradeoff relation (7) for the two representations, a large value α ≃ 6 . . . 8 for the better suited p x (x) and a small value α ≃ 1 2 for the worse suited p z (x).
For large values of g, p z (x) is dominated by the state x = (0, . . . , 0) ("all spins up"), while p x (x) becomes approximately uniform. For small values of g, in turn, p z (x) approaches a uniform distribution, whereas p x (x) is dominated by the two states x = (0, . . . , 0) and x = (1, . . . , 1), whose probability is degenerate due to the symmetry F . The different mode structure in the nonuniform limit of the σ z -and σ x -basis representations is a second important difference between them. Strongly unimodal as well as approximately uniform distributions are structurally simple, and gradient-descent training can find a globally optimal solution relatively easily. If the distribution has two (or a few) modes or dominant states, by contrast, the risk of getting stuck in a local minimum of the loss landscape and detecting only a subset of those dominant states is increased.
We presume that the bimodal structure of p x (x), which becomes increasingly pronounced as g becomes smaller, is the reason for the intermediate stage in the correlationlearning regime observed in the top panel (g = 1) of Fig. 2d. To substantiate this claim, we investigate the evolution of the model probabilityp θ (x) during training for the two modes x + := (0, . . . , 0) and x − := (1, . . . , 1). Fig. S9 shows these probabilities for three different values of g.
We focus on the middle panel (g = 1) first, where p x (x ± ) ≈ 5.6 %. The end of the correlation-learning regime is marked by the blue dashed line (blue star in the inset). Thereafter, the RBM first picks up the x − mode, accompanied by a segment with α ≃ 6 (see also Fig. 2d in the main text). The probability of the other mode x + only starts to increase slightly later at around the orange mark, which also corresponds to the starting point of the second stage of correlation learning with α ≃ 1 4 . It lasts until both modes are established (green mark), and the subsequent fine tuning proceeds with a more efficient tradeoff exponent of α ≃ 6 again. A similar, but less pronounced behavior can be observed for g = 2 (right panel of Fig. S9), where the two modes only have p x (x ± ) ≈ 0.015 %. On the contrary, if the modes become more pronounced as exemplified by the g = 1 2 case with p x (x ± ) ≈ 36 % (left panel), the RBM only learns one of the modes and fails to detect the other one.